Chapter 1: Start me up!

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
   The aim of the course is to teach how to use R, RStudio and GitHub as well as how to do statistical analyses with R.
   GitHub repository: https://github.com/phuoc-tn/IODS-project

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Tue Dec  6 04:20:54 2022"
paste("Today I stepped on", sample(1:100, size = 1, replace = FALSE), "LEGO brick(s).")
## [1] "Today I stepped on 83 LEGO brick(s)."


How are you feeling right now?
   Quite tired.

What do you expect to learn?
   How/when/why to do certain statistical analyses. Maybe new tricks to use in R.

Where did you hear about the course?
   My doctoral school informed me in an email.

How did it work as a “crash course” on modern R tools and using RStudio?
   Quite intensive.

Which were your favorite topics?
   Data parsing, using partial matches to find columns.

Which topics were most difficult?
   Transposing data frames.

Some other comments on the book and our new approach of getting started with R Markdown etc.?
   The online book is a great resource! I will definitely bookmark it for future use.


Chapter 2: Regression and model validation

Describe the work you have done this week and summarize your learning.

The main highlight for this week was that I learned how to do, (somewhat) understand, and interpret linear regression models and their results in R. I have more insight to identifying and validating variables and models to explain observed data. Let’s hope that I will get a chance to utilize these in the future.

date()
## [1] "Tue Dec  6 04:20:54 2022"

Analysis exercises

Task 1.

Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it.

1) Load tidyverse.

library(tidyverse)

2) Import data from CSV file created in data wrangling.

learning2014 <- read_csv(file = "Data/learning2014.csv", col_names = TRUE)

3) Check structure and dimensions of data.

str(learning2014)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   attitude = col_double(),
##   ..   deep = col_double(),
##   ..   stra = col_double(),
##   ..   surf = col_double(),
##   ..   points = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
dim(learning2014)
## [1] 166   7

Answer: The data is based on a statistical study about how students’ learning approaches and strategies relate to their achievements in learning statistics in an introductory course taught in the University of Helsinki, Finland. It was sampled from the original data set of the study, and consists of seven variables (gender, age, attitude, deep, stra, surf, and points) measured with 166 observations acquired from a questionnaire. The attitude variable (adjusted by dividing original attidute scores by ten) describes the attitude of students toward learning statistics measured with ten statements. The variables deep, stra, and surf are abbreviated from deep (aim to engage, learn, and understand course material), strategic (aim to maximize points via any approach), and surface (aim to pass with minimal effort) learning approaches, respectively, and were measured in a 1–5 scale (1 = disagree, 5 = agree).

Task 2.

Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.

1.1) Use pairs() to visualize data.

pairs(learning2014[-1],
      col = c("red", "blue")[factor(learning2014$gender)], # Use gender for colors.
      oma = c(2, 2, 2, 10)) # Set the margins outside of plot.
par(xpd = TRUE) # Allow legend to be outside of plot area.
legend(0.95, 0.6, # Coordinates for legend position.
       legend = unique(learning2014$gender),
       fill = c("red", "blue"))

Getting colors and legend with pairs() was harder than expected. Maybe too hard…

1.2) Or use ggplot2 and GGally to visualize data.

library(tidyverse) # ggplot2 is part of tidyverse.
library(GGally)

ggpairs(
  learning2014,
  mapping = aes(col = gender, alpha = 0.3),
  lower = list(combo = wrap("facethist", bins = 20))
) +
  scale_color_manual(values = c("red", "blue")) + # Assign new colors manually.
  scale_fill_manual(values = c("red", "blue")) + # Same here.
  theme_bw() + # Change default theme to something more clean.
  theme(strip.text = element_text(size = 14)) # Adjust panel title sizes.

2) Summarize data.

learning2014$gender <- as.factor(learning2014$gender) # Changed gender from character to factor for summary.
summary(learning2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

Answer: The age of the students ranged from 17 to 55 with most being ca. 20 years old. Based on the results, this does not seem to correlate with the number of points acquired during the statistics course. Although participating female students (n = 110) outnumbered males (n = 56) by 2:1, there were minimal differences in points acquired between these two groups. The males however scored, on average, higher in attitude compared to females.
   A significant positive correlation was found between the attitude towards learning statistics and high points; indicating that the former is a good predictor for the latter. The individual learning approaches (deep, surface, and strategic) did not strongly correlate with points. However, among them the strategic approach seems to positively correlate the most with higher points which aligns with the general aim the approach — maximize the number of points with any approach. This approach was also slightly preferred more by females compared to males on average.

Task 3.

Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent, outcome) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it.

1) Fit the learning approaches to linear model.

linear_model <- lm(points ~ deep + stra + surf, data = learning2014)

2) Check summary of linear model.

summary(linear_model)
## 
## Call:
## lm(formula = points ~ deep + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1235  -3.0737   0.5226   4.2799  10.3229 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.9143     5.1169   5.260  4.5e-07 ***
## deep         -0.7443     0.8662  -0.859   0.3915    
## stra          0.9878     0.5962   1.657   0.0994 .  
## surf         -1.6296     0.9153  -1.780   0.0769 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.827 on 162 degrees of freedom
## Multiple R-squared:  0.04071,    Adjusted R-squared:  0.02295 
## F-statistic: 2.292 on 3 and 162 DF,  p-value: 0.08016

3) Remove the approaches due to them not having a significant relationship with exam points, and redo the test as instructed.

new_linear_model <- lm(points ~ 1, data = learning2014)
summary(new_linear_model)
## 
## Call:
## lm(formula = points ~ 1, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.7169  -3.7169   0.2831   5.0331  10.2831 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.7169     0.4575   49.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.895 on 165 degrees of freedom

Answer: In short, linear regression model tests the correlation between predicting variables, which are in my case the three learning approached (deep, strategic, and surface), and predicted outcomes, i.e. the exam points. At a cursory view of the summary table and the p-values in the Coefficients section, none of the three approaches are statistically significant; meaning that there is no correlation between them and the exam points.

Task 4.

Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R-squared of the model.

Answer: The three learning approaches are not significant explanatory variables for the exam points. This can be seen, for example, in the previously mentioned Coefficients section, which shows the correlation and significance of the selected predictors (learning approaches in this case) and the exam points. The statistical significance of the predictors is shown with different symbols next to the p-values, and their corresponding values are shown immediately below (Signif. codes). The overall significance of the model — or the lack thereof — can also be seen in the F-statistic section.
   The estimates in the Coefficients section suggest (interestingly enough) that only the strategic approach positively correlates with higher exam points. This makes sense assuming students use deep and/or surfaces learning approaches depending on how familiar they are with certain topics in statistics in order to maximize points.
   The multiple R2 value shows how much the predicting variables explain the variance in the outcomes in our dataset. This means that the three approaches only explain ca. 4.1% of the variance. The adjusted R2 value, as name suggests, is scaled so that adding more variables does not result in higher R2 values. In this model, only 2.3% of the variance in exam points can be explained by the three approaches.

Task 5.

Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots.

1) Generate requested plots from the linear model with the three learning approaches.

par(mfrow = c(2, 2),
    mar = c(2, 2, 2, 2)) # Decrease margins around plots.
plot(linear_model, which = c(1, 2, 5), lwd = 2) # lwd = line width.

Answer: The Residuals vs Fitted plot shows the linearity of the fitted values and their residuals. Comparing the red line to the dashed grey line shows that the linearity between the values and residuals in my linear model is not the greatest. The Normal QQ-plot shows whether the values of the model follow a normal distribution. Looking at the plot, the data mostly follows a normal distribution apart from the tails. The Residuals vs Leverage plot shows which data points are “influential” meaning which points significantly affect the results of the linear model. Data points outside the Cook’s distance shown in grey dashed line(s) would be influential. Because no data points in this model are outside of the line, none of them are influential.


Chapter 3: Logistic regression

date()
## [1] "Tue Dec  6 04:21:07 2022"

Analysis exercises

Task 2.

The CSV file for this week’s assignment contains modified data from a study done by Prof. Paulo Cortez and Alice Silva, and which was published in 2008 (source). The table has measurements (obtained via school reports and questionnaires) of variables related to the students’ performances in mathematics and Portuguese from secondary education in two Portuguese schools (Gabriel Pereira and Mousinho da Silveira). The following attributes are found in the dataset:

library(tidyverse)

student_alc <- read_csv(file = "https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", col_names = TRUE)
colnames(student_alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The “high_use” is an added variable, and can be either TRUE or FALSE depending on whether “alc_use” is higher than two or not.

Task 3.

Due to the file containing so many interesting variables to choose from, I decided to let R — and fate — choose the ones I will be using for the rest of the exercises. These were:

set.seed(2135) # Seed number for reproducibility, when using random sampling. 

interesting_variables <- sample(colnames(student_alc), size = 4, replace = FALSE)
interesting_variables
## [1] "freetime"   "sex"        "traveltime" "Pstatus"

The Free time after school could have connections to high alcohol usage as the more time you have after school, especially if you’re a student between 15 to 22 years old, the more time you have for a drink (or two). Sex might have also connections because men typically drink more that women. Travel time, which is specified to be from home to school, is a bit harder to imagine to be related to high alcohol consumption. Maybe Portuguese students drink during long trips to school? Who knows. ¯\_(ツ)_/¯ Lastly, the parent’s cohabitation status (together or apart) might have negative correlation due to children in two-parent households having both parents to look after them, thus having less opportunities or reasons to drink.

Task 4.

library(RColorBrewer)
library(patchwork)
library(gridExtra)

interesting_variables <- append(interesting_variables, "high_use")

student_alc %>%
  select(all_of(interesting_variables)) %>% gather() %>%
ggplot(aes(value, fill = value)) +
  geom_bar() +
  scale_fill_brewer(palette = "Paired", guide = "none") +
  theme_bw() +
  theme(strip.text = element_text(size = 11, face = "bold"),
        axis.title = element_text(size = 11, face = "bold")) +
  facet_wrap("key", scales = "free")

Looking at the bar plots for the interesting variables, I notice that observations for free time are almost normally distributed. Most of the student don’t have high alcohol usage. Interestingly, there is quite a skew with the parent’s cohabitation status with most being together. The sex distribution is almost equal, and most students have a short travel time to school.

library(ggpubr)

new_student_alc <- student_alc %>%
  select(all_of(interesting_variables))

plot_a <-
  ggplot(new_student_alc, aes(x = high_use, y = freetime, fill = sex)) +
  geom_boxplot() +
  scale_fill_brewer(
    palette = "Dark2",
    name = "Sex",
    labels = c("Female", "Male")
  ) +
  scale_y_discrete(limits = c("very low", "low", "medium", "high", "very high")) +
  ylab("Free time\n(after school)") +
  theme_bw() +
  theme(
    axis.title.x = element_blank(),
    axis.title = element_text(face = "bold"),
    legend.title = element_text(face = "bold")
  )

plot_b <-
  ggplot(new_student_alc, aes(x = high_use, y = traveltime, fill = sex)) +
  geom_boxplot() +
  scale_fill_brewer(
    palette = "Dark2",
    name = "Sex",
    labels = c("Female", "Male")
  ) +
  scale_y_discrete(limits = c("<15 min", "15–30 min", "30–60 min", ">60 min")) +
  ylab("Travel time\n(home to school)") +
  theme_bw() +
  theme(
    axis.title.x = element_blank(),
    axis.title = element_text(face = "bold"),
    legend.title = element_text(face = "bold")
  )

plot_c <-
  ggplot(new_student_alc, aes(x = high_use, y = freetime, fill = Pstatus)) +
  geom_boxplot() +
  scale_fill_brewer(
    palette = "Paired",
    name = "Parent's\ncohabitation\nstatus",
    labels = c("Apart", "Together")
  ) +
  scale_y_discrete(limits = c("very low", "low", "medium", "high", "very high")) +
  ylab("Free time\n(after school)") +
  theme_bw() +
  theme(
    axis.title.x = element_blank(),
    axis.title = element_text(face = "bold"),
    legend.title = element_text(face = "bold")
  )

plot_d <-
  ggplot(new_student_alc, aes(x = high_use, y = traveltime, fill = Pstatus)) +
  geom_boxplot() +
  scale_fill_brewer(
    palette = "Paired",
    name = "Parent's\ncohabitation\nstatus",
    labels = c("Apart", "Together")
  ) +
  scale_y_discrete(limits = c("<15 min", "15–30 min", "30–60 min", ">60 min")) +
  ylab("Travel time\n(home to school)") +
  theme_bw() +
  theme(
    axis.title.x = element_blank(),
    axis.title = element_text(face = "bold"),
    legend.title = element_text(face = "bold")
  )

joined_plot <- (plot_a + plot_b + plot_layout(guides = "collect")) /
  (plot_c + plot_d + plot_layout(guides = "collect")) +
  plot_annotation(tag_levels = "A") &
  theme(plot.tag = element_text(face = "bold"))

grid.arrange(
  patchworkGrob(joined_plot),
  bottom = text_grob("High alcohol consumption", face = "bold")
)

Looking at the box plots, there doesn’t seem to be much difference in alcohol consumption with travel time among the sexes nor the parent’s cohabitation status (panels B and D). However with free time (panels A and C), there seems to be small increase with more free time. This somewhat aligns with my assumptions. The increase seems to be also higher on average with males (panel A), which aligns with my initial hypotheses. Curiously, in notice that students with separated parents have on average less free time, but also higher alcohol consumption. This contradicts my hypotheses.

Task 5.

model <- glm(high_use ~ freetime + sex + traveltime + Pstatus - 1, data = student_alc, family = "binomial")
summary(model)
## 
## Call:
## glm(formula = high_use ~ freetime + sex + traveltime + Pstatus - 
##     1, family = "binomial", data = student_alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3632  -0.8695  -0.6220   1.1984   1.9127  
## 
## Coefficients:
##            Estimate Std. Error z value Pr(>|z|)    
## freetime     0.3023     0.1253   2.413   0.0158 *  
## sexF        -2.6603     0.5946  -4.474 7.69e-06 ***
## sexM        -1.8937     0.6263  -3.023   0.0025 ** 
## traveltime   0.4015     0.1617   2.483   0.0130 *  
## PstatusT    -0.1924     0.3912  -0.492   0.6228    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 512.93  on 370  degrees of freedom
## Residual deviance: 424.31  on 365  degrees of freedom
## AIC: 434.31
## 
## Number of Fisher Scoring iterations: 4

The summary shows that Pstatus might not statistically significant in my model.

OR <- coef(model) %>% exp
CI <- confint(model) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## freetime   1.35291603 1.06175190 1.7370368
## sexF       0.06992947 0.02089100 0.2168783
## sexM       0.15051787 0.04259748 0.5008024
## traveltime 1.49409723 1.08919695 2.0594122
## PstatusT   0.82496011 0.38967324 1.8280994

Pstatus has the widest confidence interval. The lower interval is below 1 and upper higher than 1, meaning there’s no statistically significant association with high alcohol usage. This seems to contradict previous results. Free and travel time has statistically significant positive association (upper and lower intervals >1), which aligns with my assumptions. Meanwhile sex has negative (upper and lower intervals <1), which contradicts my assumptions.

library(finalfit)

factor_student_alc <-
  new_student_alc %>%
  mutate(sex = factor(sex) %>%
           fct_recode("Female" = "F",
                      "Male"   = "M") %>%
           ff_label("Sex")) %>%
  mutate(
    Pstatus = factor(Pstatus) %>%
      fct_recode("Apart" = "A",
                 "Together"   = "T") %>%
      ff_label("Parent's cohabitation status")
  ) %>%
  mutate(
    traveltime = factor(traveltime) %>%
      fct_recode(
        "<15 min" = "1",
        "15–30 min" = "2",
        "30–60 min" = "3",
        ">60 min" = "4"
      ) %>%
      ff_label("Travel time (home to school)")
  ) %>%
  mutate(
    freetime = factor(freetime) %>%
      fct_recode(
        "very low" = "1",
        "low" = "2",
        "medium" = "3",
        "high" = "4",
        "very high"   = "5"
      ) %>%
      ff_label("Free time (after school)")
  ) %>%
  mutate(high_use = ff_label(high_use, "High alcohol consumption"))

factor_student_alc %>%
  summary_factorlist(
    "high_use",
    c("freetime", "sex", "traveltime", "Pstatus"),
    p = TRUE,
    add_dependent_label = TRUE
  )
## Warning in chisq.test(traveltime, high_use): Chi-squared approximation may be
## incorrect
##  Dependent: High alcohol consumption                FALSE      TRUE      p
##             Free time (after school)  very low   15 (5.8)   2 (1.8)  0.032
##                                            low  46 (17.8) 14 (12.6)       
##                                         medium 112 (43.2) 40 (36.0)       
##                                           high  65 (25.1) 40 (36.0)       
##                                      very high   21 (8.1) 15 (13.5)       
##                                  Sex    Female 154 (59.5) 41 (36.9) <0.001
##                                           Male 105 (40.5) 70 (63.1)       
##         Travel time (home to school)   <15 min 174 (67.2) 68 (61.3)  0.003
##                                      15–30 min  73 (28.2) 26 (23.4)       
##                                      30–60 min   10 (3.9)  11 (9.9)       
##                                        >60 min    2 (0.8)   6 (5.4)       
##         Parent's cohabitation status     Apart  26 (10.0) 12 (10.8)  0.970
##                                       Together 233 (90.0) 99 (89.2)

Using summary_factorlist() from the finalfit package reveals the same.

factor_student_alc %>%
  finalfit("high_use",
           c("sex", "Pstatus", "traveltime", "freetime"),
           metrics = TRUE)
## [[1]]
##  Dependent: High alcohol consumption                FALSE      TRUE
##                                  Sex    Female 154 (79.0) 41 (21.0)
##                                           Male 105 (60.0) 70 (40.0)
##         Parent's cohabitation status     Apart  26 (68.4) 12 (31.6)
##                                       Together 233 (70.2) 99 (29.8)
##         Travel time (home to school)   <15 min 174 (71.9) 68 (28.1)
##                                      15–30 min  73 (73.7) 26 (26.3)
##                                      30–60 min  10 (47.6) 11 (52.4)
##                                        >60 min   2 (25.0)  6 (75.0)
##             Free time (after school)  very low  15 (88.2)  2 (11.8)
##                                            low  46 (76.7) 14 (23.3)
##                                         medium 112 (73.7) 40 (26.3)
##                                           high  65 (61.9) 40 (38.1)
##                                      very high  21 (58.3) 15 (41.7)
##                OR (univariable)             OR (multivariable)
##                               -                              -
##    0.19 (0.10 to 0.28, p<0.001)   0.15 (0.06 to 0.25, p=0.002)
##                               -                              -
##  -0.02 (-0.17 to 0.14, p=0.823) -0.06 (-0.21 to 0.09, p=0.464)
##                               -                              -
##  -0.02 (-0.12 to 0.09, p=0.734) -0.02 (-0.12 to 0.09, p=0.724)
##    0.24 (0.04 to 0.45, p=0.019)   0.25 (0.05 to 0.45, p=0.015)
##    0.47 (0.15 to 0.79, p=0.004)   0.42 (0.10 to 0.73, p=0.010)
##                               -                              -
##   0.12 (-0.13 to 0.36, p=0.355)  0.08 (-0.16 to 0.32, p=0.525)
##   0.15 (-0.08 to 0.37, p=0.212)  0.11 (-0.11 to 0.33, p=0.334)
##    0.26 (0.03 to 0.50, p=0.027)  0.22 (-0.01 to 0.45, p=0.066)
##    0.30 (0.04 to 0.56, p=0.026)  0.21 (-0.06 to 0.47, p=0.124)
## 
## [[2]]
##                                                                                                                                                   
##  Number in dataframe = 370, Number in model = 370, Missing = 0, Log-likelihood = -218.3, AIC = 458.6, R-squared = 0.093, Adjusted R-squared = 0.07
factor_student_alc %>%
  or_plot(
    "high_use",
    c("sex", "Pstatus", "traveltime", "freetime"),
    title_text_size = 13,
    table_text_size = 4,
    column_space = c(-3, 3.2, 7.5),
    breaks = c(0.5, 1, 2.5, 5, 10, 25, 50)
  )

Task 6.

Based on the previous model results, I will exclude Pstatus from further analyses.

model <- glm(high_use ~ sex + freetime + traveltime - 1, data = factor_student_alc, family = "binomial")
probabilities <- predict(model, type = "response")
factor_student_alc <- mutate(factor_student_alc, probability = probabilities) %>% 
  mutate(factor_student_alc, prediction = probability > 0.5)
table(high_use = factor_student_alc$high_use, prediction = factor_student_alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   252    7
##    TRUE     97   14
table(high_use = factor_student_alc$high_use,
      prediction = factor_student_alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.68108108 0.01891892 0.70000000
##    TRUE  0.26216216 0.03783784 0.30000000
##    Sum   0.94324324 0.05675676 1.00000000
ggplot(factor_student_alc,
       aes(x = probability, y = high_use, col = prediction)) +
  geom_point() +
  theme_bw()

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(class = factor_student_alc$high_use, prob = factor_student_alc$probability)
## [1] 0.2810811

According to the results of cross validation, it seems that 28.1% of the average number of predictions in my final model are wrong, which probably isn’t that great.


Chapter 4: Clustering and classification

date()
## [1] "Tue Dec  6 04:21:16 2022"

Analysis exercises

Task 2.

The data set, titled “Boston”, used in this exercise contains 14 variables (columns) and 506 observations (rows). It entails information about housing values used to evaluate the willingness of people to pay for cleaner air in the Boston metropolitan area in the 1970s (source).

library(MASS)
data("Boston")

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

Task 3.

Looking at the plots and summary table, it looks like most variables are not equally distributed, such as crim, zn, and rad. Also many of the variables have strong correlations (both positive and negative) with each other based on the plots. The weakest correlating variable seems to be chas, i.e. the Charles River dummy variable. This might be due to it being a binary variable (0 or 1). Interestingly, indus, i.e. the proportion of non-retail business acres per town, and tax, i.e. full-value property-tax rate per 10 000 $, are bimodally distributed.

library(GGally)

ggpairs(Boston, lower = list(
  continuous = wrap(
    "smooth",
    alpha = 0.3,
    size = 0.5,
    color = "firebrick1"
  )
)) + theme_bw()

library(tidyverse)
library(corrplot)

cor_matrix <- cor(Boston) %>% round(digits = 2)

corrplot(
  cor_matrix,
  method = "circle",
  type = "lower",
  cl.pos = "b",
  tl.pos = "d",
  tl.cex = 0.6
)

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Task 4.

The original Boston data set had variables measured in different ways and scales/magnitudes. After using scale(), all variables and observations have been normalized to the same scale. This allows for better comparison between them.

boston_scaled <- Boston %>% scale()

summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
boston_scaled_df <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled_df$crim)

crime <-
  cut(
    boston_scaled_df$crim,
    breaks = bins,
    include.lowest = TRUE
  )

boston_scaled_df <- boston_scaled_df %>% dplyr::select(-crim)
boston_scaled_df <- data.frame(boston_scaled_df, crime)

levels(boston_scaled_df$crime) <- c("low", "med_low", "med_high", "high")

n <- nrow(boston_scaled_df)

set.seed(2126)
ind <- sample(n,  size = n * 0.8)

train <- boston_scaled_df[ind,]
test <- boston_scaled_df[-ind,]

Task 5.

lda.fit <- lda(crime ~ ., data = train)

lda.arrows <-
  function(x,
           myscale = 1,
           arrow_heads = 0.1,
           color = "black",
           tex = 0.75,
           choices = c(1, 2)) {
    heads <- coef(x)
    arrows(
      x0 = 0,
      y0 = 0,
      x1 = myscale * heads[, choices[1]],
      y1 = myscale * heads[, choices[2]],
      col = color,
      length = arrow_heads
    )
    text(
      myscale * heads[, choices],
      labels = row.names(heads),
      cex = tex,
      col = color,
      pos = 3
    )
  }

classes <- as.numeric(train$crime)

plot(lda.fit,
     dimen = 2,
     col = c("red", "blue", "purple", "gold")[classes])
lda.arrows(lda.fit, myscale = 2)

Task 6.

The table shows that the majority of the predicted results of our model overall are correct (74 out of 102, 72.55%). Looking at the table in more detail, the model predicted correctly 16 out of 25 (64.0%) low crime rates, 16 out of 28 (57.14%) medium low crime rates, 12 out of 18 (66.67%) medium high crime rates, and 30 out of 31 (96.77%) high crime rates. This shows that the model is highly accurate in predicting high crime rates, however is somewhat inaccurate with lower crime rates, especially with medium low ones.

correct_classes <- test$crime
new_test <- test %>% dplyr::select(-crime)

lda.pred <- predict(lda.fit, newdata = new_test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       16       5        1    0
##   med_low    8      16        5    0
##   med_high   1       7       12    1
##   high       0       0        0   30

Task 7.

As it was not specified which distances we were supposed to compute, might as well compute both Euclidian and Manhattan distances. For the initial k-means analysis, I randomly picked three clusters. But after determining the optimal number of clusters the re-analysis with the total of within cluster sum of squares, I ended up with two based on the line plot. That was due to the steepest drop being between one and two clusters. The scaling of the Boston dataset seems to do weird things to the distribution of the rad variable. I’m not sure why. Anyway, looking at the new distribution with k-means of two clusters, the indus, nox, and tax variable have two different colored peaks. This makes sense as I previously stated that these are bimodally distributed. I would assume that the scaled data has two centroids.

data("Boston")

new_boston_scaled <- as.data.frame(scale(Boston))
dist_eu <- dist(new_boston_scaled)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
dist_man <- dist(new_boston_scaled, method = "manhattan")
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618
library(RColorBrewer)

k_means <- kmeans(new_boston_scaled, centers = 3)

ggpairs(
  as.data.frame(new_boston_scaled),
  aes(color = as.factor(k_means$cluster)),
  lower = list(continuous = wrap(
    "smooth",
    alpha = 0.3,
    size = 0.5
  )),
  upper = list(continuous = wrap("cor", size = 2))
) +
  scale_color_brewer(palette = "Dark2") +
  scale_fill_brewer(palette = "Dark2") +
  theme_bw()

twcss <- sapply(1:10, function(k){kmeans(new_boston_scaled, k)$tot.withinss})

qplot(x = 1:10, y = twcss, geom = "line") +
  scale_x_continuous(breaks = c(1:10)) +
  labs(x = "clusters") +
  theme_bw()

new_k_means <- kmeans(new_boston_scaled, centers = 2)

ggpairs(
  new_boston_scaled,
  aes(color = as.factor(new_k_means$cluster)),
  lower = list(continuous = wrap(
    "smooth",
    alpha = 0.3,
    size = 0.5
  )),
  upper = list(continuous = wrap("cor", size = 2))
) +
  scale_color_brewer(palette = "Dark2") +
  scale_fill_brewer(palette = "Dark2") +
  theme_bw()


Chapter 5: Dimensionality reduction techniques

date()
## [1] "Tue Dec  6 04:22:44 2022"

Analysis exercises

Overview of the data

At a glance, each variable seems to follow a (positive or negative) skewed normal distribution. Only Edu.Exp i.e, expected years of education, is nearly symmetrically distributed. Certain variables, such as gross national income (GNI), maternal mortality ratio (Mat.Mor), and adolescent birth rate (Ado.Birth), have a wide range between their minimum and maximum values as well as their means. This suggests an uneven and highly skewed distribution. Curiously, the highest GNI (123 124), which is also quite an outlier, belongs to Qatar. I wonder whether this has anything to do with its foreign aid/investment(s), oil and gas exports, and its currently hosted FIFA World Cup.
   The highest positive and statistically significant correlation is between expected years of education and life expectancy at birth (Life.Exp; 0.789), and conversely, the lowest is between (Mat.Mor) and life expectancy at birth (−0.857). The former makes sense as if you have a high life expectancy, you most likely live in a country in which you do not have to struggle to survive. Therefore, you can spend more time getting a degree. The latter also is sensible: if a mother dies at child birth, the child’s survival decreases, especially in developing countries.

library(tidyverse)
library(GGally)

human_data <-
  read.csv("Data/new_human.csv",
           header = T,
           row.names = "X")

custom_scatter_plots <-
  function(data, mapping, method = "lm", ...) {
    ggplot(data = data, mapping = mapping) +
      geom_point(colour = "dodgerblue",
                 alpha = 0.3,
                 size = 1) +
      geom_smooth(method = method,
                  color = "blueviolet",
                  linewidth = 0.5,
                  ...)
  }

ggpairs(
  human_data,
  lower = list(continuous = wrap(custom_scatter_plots)),
  upper = list(continuous = wrap("cor", size = 4, color = "black"))
) +
  theme_bw() +
  theme(
    strip.text = element_text(size = 11, color = "black", face = "bold"),
    axis.text.x = element_text(
      angle = 45,
      hjust = 1,
      vjust = 1
    )
  )

summary(human_data)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Principal component analysis (PCA)

The results between standardized and non-standardized human data is quite different. The loads (indicated with red arrows) for the PCA results of the non-standardized human data reveals that the scales between the observations among the variables is really different. Looking at the data, gross national income (GNI), and maternal mortality ratio (Mat.Mor) are measured in the thousands, meanwhile the other variables are either ratios or measured — at most — in the hundreds. This results in them being the most contributing factors in the analysis, and thus creates the boomerang-shaped distribution in the non-standardized plot. The PCA dimensions also look odd as dimension 1 explains 99.99% of the variance in the data, while dimension 2 only 0.01%. Standardizing amends this as the variables are then comparable to each other on a same scale, which can be seen e.g., with the loads looking more reasonable (dimension 1 explains 53.6% and dimension 2 16.24% of the variance), and more circular distribution of data points.
   The clustering of countries in the PCA of standardized human data is interesting. Western nations (in Europe and North America) as well as Australia and New Zealand cluster towards female-to-male labor force participation rate (Labo.FM), female percent representation in parliament (Parli.F), expected years of education (Edu.Exp), GNI, female-to-male ratio with secondary education (Edu2.FM), and life expectancy at birth (Life.Exp). This is reasonable as these nations are known for stability, safety, and fulfilling human rights.
   Curiously many of the Caribbean and Latin american countries are clustering only to Parli.F and Labo.FM. This suggests that these countries have more/many females in parliamentary positions and in the work force, but also are not the wealthiest (based on the GNI load) nor the safest (based on e.g., Mat.Mor or Life.Exp). The high values in Parli.F and Labo.FM might be due to poor families sending all available/capable family members to work out of necessity.
   The Sub-Saharan, ca. half of the East Asian, and most of the South Asian countries cluster mainly due to Mat.Mor and Ado.Birth. Considering these countries are mostly poor developing countries with less-than-great track record with human rights, it is not surprising that females in these regions give more birth while young and die more during child birth most likely due to the former.
   Lastly, most of the countries in the Middle East and North Africa cluster towards Edu.Exp, GNI, Edu2.FM, and Life.Exp. High values in GNI are sensible due to, for example, oil and gas exports, whilst in the others the reasons are unclear (at least, for me at the moment of writing).

library(ggfortify)
library(ggthemes)
library(countrycode)
library(patchwork)

pca_human <- prcomp(human_data)
new_human_data <-
  human_data %>% mutate(Region = countryname(row.names(human_data), destination = "region"))

pca1 <- autoplot(
  pca_human,
  data = new_human_data,
  colour = "Region",
  label = TRUE,
  label.size = 4,
  label.repel = TRUE,
  loadings = TRUE,
  loadings.label = TRUE,
  loadings.label.size = 4,
  loadings.label.repel = TRUE,
) +
  scale_color_manual(values = ptol_pal()(length(unique(
    new_human_data$Region
  ))),
  guide = guide_legend(override.aes = list(size = 3))) +
  geom_hline(yintercept = 0,
             linetype = "dashed",
             alpha = 0.4) +
  geom_vline(xintercept = 0,
             linetype = "dashed",
             alpha = 0.4) +
  ggtitle("Non-standardized human data") +
  theme_bw()

pca_human_scaled <- prcomp(scale(human_data))

pca2 <- autoplot(
  pca_human_scaled,
  data = new_human_data,
  colour = "Region",
  label = TRUE,
  label.size = 4,
  label.repel = TRUE,
  loadings = TRUE,
  loadings.label = TRUE,
  loadings.label.size = 4,
  loadings.label.repel = TRUE
) +
  scale_color_manual(values = ptol_pal()(length(unique(
    new_human_data$Region
  ))),
  guide = guide_legend(override.aes = list(size = 3))) +
  geom_hline(yintercept = 0,
             linetype = "dashed",
             alpha = 0.4) +
  geom_vline(xintercept = 0,
             linetype = "dashed",
             alpha = 0.4) +
  ggtitle("Standardized human data") +
  theme_bw()

pca1 + pca2 + plot_layout(guides = "collect") &
  theme(legend.text = element_text(color = "black", size = 10))
**Principal component analysis (PCA) of standardized and non-standardized human data.** **In the left panel**, gross national income (GNI), and maternal mortality ratio (Mat.Mor) pull the data in two directions: Affluent oil-exporting countries, such as Qatar, United Arab Emerates, and United States, and banking countries, such as Luxemburg and Switzerland, cluster towards the former; whilst many of the Sub-Saharan African countries are pulled towards  the latter. Dimension 1 explains 99.99% of the variance, and dimension 2 explains 0.01%. **In the right panel**, there are overlapping region clusters pulled in three directions based on loads: countries in the Sub-Saharan Africa cluster towards Mat.Mor, adolescent birth rate (Ado.Birth), female percent representation in parliament (Parli.F), and female-to-male labour force participation rate (Labo.FM); several Latin American and Caribbean countries towards Labo.FM and Parli.F; European, North American, and Pacific countries towards Labo.FM, Parli.F, expected years of education (Edu.Exp), GNI, female-to-male ratio with secondary education (Edu2.FM), and life expectancy at birth (Life.Exp); and lastly, countries in South, East and Central Asia, Middle East, and North Africa towards Edu.Exp, GNI, Edu2.FM, and Life.Exp. Dimensions 1 and 2 explain 53.6% and 16.24% of the variance, respectively.

Principal component analysis (PCA) of standardized and non-standardized human data. In the left panel, gross national income (GNI), and maternal mortality ratio (Mat.Mor) pull the data in two directions: Affluent oil-exporting countries, such as Qatar, United Arab Emerates, and United States, and banking countries, such as Luxemburg and Switzerland, cluster towards the former; whilst many of the Sub-Saharan African countries are pulled towards the latter. Dimension 1 explains 99.99% of the variance, and dimension 2 explains 0.01%. In the right panel, there are overlapping region clusters pulled in three directions based on loads: countries in the Sub-Saharan Africa cluster towards Mat.Mor, adolescent birth rate (Ado.Birth), female percent representation in parliament (Parli.F), and female-to-male labour force participation rate (Labo.FM); several Latin American and Caribbean countries towards Labo.FM and Parli.F; European, North American, and Pacific countries towards Labo.FM, Parli.F, expected years of education (Edu.Exp), GNI, female-to-male ratio with secondary education (Edu2.FM), and life expectancy at birth (Life.Exp); and lastly, countries in South, East and Central Asia, Middle East, and North Africa towards Edu.Exp, GNI, Edu2.FM, and Life.Exp. Dimensions 1 and 2 explain 53.6% and 16.24% of the variance, respectively.

Multiple Correspondence Analysis (MCA)

I decided on running the MCA on six random variables: escape.exoticism, where, dinner, Tea, resto, and pub. Looking at the Eigenvalues in the summary, Dim.1 and Dim.2 explain overall 19.761% and 14.687% of the variance, respectively. In the Categories section, the ctr column in Dim.1 shows five categories with highest contribution (crt > 12.0): chain store+tea shop, tea shop, green, resto, and pub.
   The plots show correlations between categories. For example, dinner and tea shop cluster near each other, suggesting strong correlation, which would make sense as people probably would drink tea during dinner at tea shops.

# Load tea data.
tea_data <-
  read.csv(
    "https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv",
    stringsAsFactors = TRUE
  )

# Show data structure.
str(tea_data)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# Plot data.
tea_data %>%
  dplyr::select(-age) %>% # Removed age due to age_Q already existing as factor.
  pivot_longer(cols = everything()) %>%
  ggplot(aes(value, fill = name)) +
  geom_bar() +
  guides(fill = "none") +
  theme_bw() +
  theme(axis.text.x = element_text(
    angle = 45,
    hjust = 1,
    size = 8
  )) +
  facet_wrap("name", scales = "free")

library(FactoMineR)

# Pick randomly six variables.
set.seed(2133)
interesting_vars <- sample(colnames(tea_data), 6)
new_tea_data <- tea_data %>% dplyr::select(all_of(interesting_vars))

# Run MCA.
mca_tea_data <- MCA(new_tea_data, graph = FALSE)
# summary(mca_tea_data, nbelements = Inf) # nbelements = Inf prints whole summary with its 300 individuals.
summary(mca_tea_data)
## 
## Call:
## MCA(X = new_tea_data, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.263   0.196   0.182   0.164   0.153   0.136   0.123
## % of var.             19.761  14.687  13.629  12.290  11.484  10.230   9.258
## Cumulative % of var.  19.761  34.448  48.076  60.366  71.850  82.080  91.338
##                        Dim.8
## Variance               0.115
## % of var.              8.662
## Cumulative % of var. 100.000
## 
## Individuals (the 10 first)
##                         Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                    |  0.260  0.085  0.078 | -0.298  0.151  0.102 |  0.709
## 2                    |  0.133  0.022  0.020 | -0.609  0.632  0.411 |  0.247
## 3                    |  0.123  0.019  0.005 |  0.670  0.764  0.147 | -0.241
## 4                    |  0.454  0.260  0.076 |  0.166  0.047  0.010 | -0.954
## 5                    |  0.059  0.004  0.007 | -0.524  0.468  0.563 | -0.425
## 6                    |  0.580  0.425  0.127 |  0.477  0.388  0.086 | -0.493
## 7                    |  0.185  0.043  0.076 | -0.213  0.077  0.100 |  0.036
## 8                    |  0.260  0.085  0.078 | -0.298  0.151  0.102 |  0.709
## 9                    | -0.133  0.022  0.021 |  0.153  0.040  0.028 | -0.030
## 10                   | -0.058  0.004  0.003 |  0.067  0.008  0.004 |  0.642
##                         ctr   cos2  
## 1                     0.922  0.578 |
## 2                     0.112  0.068 |
## 3                     0.107  0.019 |
## 4                     1.671  0.339 |
## 5                     0.332  0.371 |
## 6                     0.445  0.091 |
## 7                     0.002  0.003 |
## 8                     0.922  0.578 |
## 9                     0.002  0.001 |
## 10                    0.756  0.330 |
## 
## Categories (the 10 first)
##                          Dim.1     ctr    cos2  v.test     Dim.2     ctr
## escape-exoticism     |  -0.205   1.258   0.038  -3.360 |  -0.436   7.646
## Not.escape-exoticism |   0.184   1.131   0.038   3.360 |   0.392   6.872
## chain store          |   0.124   0.623   0.027   2.860 |  -0.464  11.748
## chain store+tea shop |  -0.855  12.027   0.257  -8.765 |   0.505   5.649
## tea shop             |   1.429  12.924   0.227   8.239 |   1.658  23.410
## dinner               |   1.130   5.656   0.096   5.362 |   1.704  17.301
## Not.dinner           |  -0.085   0.426   0.096  -5.362 |  -0.128   1.302
## black                |  -0.055   0.048   0.001  -0.546 |  -0.208   0.905
## Earl Grey            |  -0.284   3.278   0.145  -6.592 |   0.019   0.019
## green                |   1.784  22.141   0.393  10.844 |   0.356   1.189
##                         cos2  v.test     Dim.3     ctr    cos2  v.test  
## escape-exoticism       0.171  -7.142 |  -0.622  16.795   0.348 -10.196 |
## Not.escape-exoticism   0.171   7.142 |   0.559  15.094   0.348  10.196 |
## chain store            0.383 -10.707 |   0.054   0.170   0.005   1.242 |
## chain store+tea shop   0.090   5.179 |  -0.117   0.325   0.005  -1.197 |
## tea shop               0.306   9.559 |  -0.041   0.015   0.000  -0.237 |
## dinner                 0.219   8.084 |  -1.258  10.166   0.119  -5.970 |
## Not.dinner             0.219  -8.084 |   0.095   0.765   0.119   5.970 |
## black                  0.014  -2.055 |   1.228  34.109   0.494  12.149 |
## Earl Grey              0.001   0.434 |  -0.492  14.290   0.437 -11.429 |
## green                  0.016   2.167 |   0.125   0.157   0.002   0.758 |
## 
## Categorical variables (eta2)
##                        Dim.1 Dim.2 Dim.3  
## escape.exoticism     | 0.038 0.171 0.348 |
## where                | 0.404 0.479 0.006 |
## dinner               | 0.096 0.219 0.119 |
## Tea                  | 0.403 0.025 0.529 |
## resto                | 0.384 0.051 0.080 |
## pub                  | 0.256 0.231 0.008 |
par(mfrow = c(1, 2)) # Plot figures next to each other.

plot.MCA(
  mca_tea_data,
  invisible = "ind",
  graph.type = "classic",
  habillage = "quali",
  title = "MCA factor map"
)

plot.MCA(
  mca_tea_data,
  invisible = "ind",
  graph.type = "classic",
  habillage = "quali",
  selectMod = "contrib 5",
  # Show five categories that contribute the most (crt > 12.0).
  title = "MCA factor map with most contributing categories"
)


(More chapters to be added…)